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Abstract — Information criteria are an appropriate and widely 
used tool for solving model selection problems. However, different 
ways to use them exist, each leading to a more or less precise 
approximation of the sought model. In this paper, we mainly 
present two methods of utilisation of information criteria : the 
classical one which is generally used and an alternative one, more 
precise but requiring a little more calculations. Those methods 
are compared on 1-D and 2-D autoregressive models; we use 
a synthetized process for the 1-D case and texture images for 
the 2-D case. We also work with the original cpp criterion which 
includes all others usual criteria such as AIC, BIC, and ip, 

I. Introduction 

An observation = of a stochastic process 

X and a parametric family of probability density functions 
{/(.|6>), e &} being given, the Maximum Likelihood (ML) 
method allows to estimate a parameter 6 e O fitting the 
observation. However, the problem of model selection is of 
greater interest. Let us cite for example the determination of 
the number of components of a mixture law, the order of an 
autoregression [6], [3], or of a Multiple Markov Chain [14]. 

Unfortunately, for this problem, the ML method fails and 
overestimates the sought model. This is mainly due to the fact 
that there exists in 6 a parameter giving a high probability to 
the observation, even though that parameter may have many 
components. This is typically the case for an observation of 
length n of a Multiple Markov Chain which may always be 
given a probability 1 if we suppose that its order is n — 1. 

An alternative method to ML is given by Information 
Criteria (IC). They are written under the general form IC = 
— log (ML) + Pen, where Pen is a penalty term growing as 
the parameter becomes complex. Since the term — log (ML) 
has the opposite variation, the minimization of IC realizes a 
compromise between the data fitting and the complexity of the 
chosen parameter. Applications of those criteria are numerous, 
in signal processing as well as in pattern recognition [3]. 

Different kinds of penalties are suggested. Based upon the 
minimization of a Kullback risk, Akaike [1] introduced the first 



criterion AIC ; Schwarz [13] then suggested the BIC criterion 
using Bayesian estimation. Next, Rissanen used notions of co- 
ding and stochastic complexity [11], [12] to justify a criterion 
which has asymptotically the same expression as BIC. In the 
continuity of the work of Rissanen, El-Matouat and Hallin [5] 
introduced the family of criteria (pp. Note that the criterion (p 
given by Hannan and Quinn in [6] is prior to (pp and is its limit 
case for = 0. In a general frame, Nishii [7] gave sufficient 
conditions on the penalty for those criteria to be weakly or 
strongly consistent. 

In a first section, the problem of model selection is set, 
as well as the general method of utilisation of IC which 
requires too many computations. Subsections IIII-AI and IIII-BI 
describe the two methods we study : classical method and 
alternative method. The classical one, widely used, is based 
upon embedded models ; it has the advantage of requiring few 
computations but only gives a rough approximation of true 
model. The alternative one, referred to as "Nishii method", is 
presented by Nishii, Zhao and al. [15], [7], [8] and allows a 
more significant selection of the model at the cost of slightly 
more computations. To our knowledge, this method is not often 
used but deserves attention. In section [IVl we compare the two 
methods in the case of 1-D or 2-D autoregressive models. Only 
the (pp criterion will be used since it includes AIC, BIC, and 
(p criteria. 

II. Model selection by IC 

Let {ft'' ] A^] f{.\0), 6> G 6} be a statistical structure, where 
6 is a subset of and = xi , . . . , a realisation of 
the unknown density f{.\0). We choose a reference parameter 

= (6>?, . . . , 6>^) G 6, usually the null vector Let us denote 
by S'^ the support of : 

s* = {j€ii,mj \ejj^e'^} 

where |1, m] is the set of integers {1, . . . , m}. For any support 
S we note &s the set of parameters whose support is S. 



Selecting the model is determining, from x, the support 
S^. Once a support S is chosen, the unknown parameter 
is estimated in the ML sense in 6^. 

Information Criteria are an appropriate tool for selecting the 
support. For S C |1, m], they have the general form : 

IC{S) = -2\ogf{x^\0s) + \S\a{n) (1) 

where l^l is the cardinal of S and Os is estimated in the ML 
sense in 65. The penalties a{n) for the criteria we use are : 

-AIC criterion, a{n) =2 

-BIC criterion, a{n) = logn (2) 
-(/:?/3 criterion, a(n) = loglogn 

For a fixed n, adjusting the value of (3 in the penalty function 
(|2]) of the (pp criterion allows to obtain others criteria : 

^Aic = (log 2 - log log logn) /logn 
pBic = (log log n - log log log n)/ log n 

Consequently, we will only use the (pp criterion for P ranging 
from to 1 ; /3 = corresponds to the (p criterion. Moreover, 
in [9] the following bounds on P are proposed : 

loe: loss: n 

Pmin = ? ^ < /? < 1 - /3min = /3max (4) 

logn 

It has been shown empirically in several contexts that, for a 
classic utilisation of IC (see section HTl-AI) . the value /^min of- 
ten gives the best results ; however the theoretical justification 
of this result has not been established by the authors yet. In 
our simulations, we present the value of (3max even though it 
gives poor results in most cases. 

The selection of the support is then done via the minimiza- 
tion of IC(S') among all supports : 

S = Argmin{IC(5) | S C [1, m]} (5) 

A criterion is said strongly consistent if S converges almost- 
surely (a.s.) to 6''^ as n ^ oo ; it is said weakly consistent if 
the convergence only is in probability. Using the conditions 
of Nishii [7], in the case of a product statistical structure, the 
BIC and (pp criteria, < < 1, present a strong consistency. 
Those results are extended to the linear regression model, 
including the autoregressive models used here, by Nishii and 
al. in [8]. Those conditions hold with BIC and (pf^ criteria for 
the two methods we will discuss : k defined by (|6] and U} 
converges a.s. to and S defined by © converges a.s. to 
S\ 

The method ^ answers the problem of model selection, 
but requires many computations, see table I for details. Here, 
we study two lighter methods. 

III. The studied methods 
A. Classical method 

Let us take m nested subsets of 9 : 6i C • • • C 9m C 9 
called models of order /c G |1, m] ; for example 9/e = M^. 



The problem is then restricted to the determination, from 
X, of the order k^ of the smallest model 9/c* containing the 
unknown parameter 6. To this end, we set 

\C{k) = -21og/(x"|4) + \&kHn) (6) 

where Ok is estimated in the ML sense in the model Qk and 
1 9/e I is the number of free components of this model. 

The selection of the order is done via the minimization of 
lC{k) among k : 

k = Argmin{IC(A;) \ k e {1, m]} (7) 

This method requires the least operations, see table I for 
details, but does not solve the problem of the determination 
of the support S~^. 

B. Nishii method 

A reference parameter 6^ = (6>2, • . • , 6^^) ^ ® fixed. 
Using the notation of ([B, we set ICref = IC(|l,m]). This is 
the reference value of the criterion computed on the model 9^ 
where all components are free. Then, for j G we set 

IC(— j) = IC(|1, m]\{j}) the value of the criterion computed 
on the model where all components are free, except the j-th 
which is frozen to 0^, generally 0. The Nishii method consists 
in choosing as an estimation of the support the set of indexes : 

S = {jelhmj I IC(-j) > IQef} (8) 

Those are the important indexes in the sense that the criterion 
prefers the full model rather than the model where the j-th 
component is frozen. 

For a brief comparison of the different methods in terms 
of computations, let us suppose that each model of order 
k G in IIII-AI has dimension k. The table I gives the 

number of operations required to solve the model selection 
problem, each computation of an IC being weighted by the 
dimension in which it has to be done, e.g. 2 computations in 
dimension 5 count for 10 operations. 

Table I : comparison in terms of required operations 
Method : General © Classical © Nishii © 
Selection : Support Order Support 

Operations: m2^~^ m(m + l)/2 



IV. Application in the autoregression case 

Let us recall the expression of Gaussian autoregressive (AR) 
models in d dimensions : 

Xt = -Y.a,Xt-^^Es,t (9) 

ies 

where S e Z^is the set of indices associated to the regression, 
Es = {Es^t)tezd is a Gaussian white noise with variance cr|. 



A. One-dimensional auto regress ion 

1) Presentation: In ID, the classical used support S of 
the model is of the form defining the model of order 
k, called &k (see IIII-Al) . As 6k = {a/e,cr|} with a/. = 
(ai, . . . , a/e) and is the variance of the associated Gaussian 
white noise, \Sk\ = k^l while lO^I = + 1. Selecting the 
order of the model (see IIII-AI) is finding k ; while selecting 
the support for 0^ = (see IIII-BL is finding the indexes 
j G |1, m] for which aj ^ 0, m being the maximum value of 
the order. 

The Yule- Walker equations allow to estimate the parameters 
in the ML sense and it is known that minus the maximal log- 
likelihood is equal to n(log(27r(j|) + l). Dropping terms which 
do not depend on k or S, the expression © and ([B of the 
criteria respectively become : 

lC{k) = n\ogal^ka{n) 
IC{S) = n\ogal^\S\a{n) 

where is the estimated variance assuming the model of 
order k, and (j| the one estimated supposing the support is S. 
A realisation of that process being given, we may apply 
the two methods (|7]) and ([5]) discussed above. Typically, if 
a = (— l,0,l),we expect the classical method to choose order 
k = 3 and the Nishii method to choose support S = {1,3}. 

We generate 100 observations of an AR process Q of 
order 15 and parameters 

a = (0.5, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45), = 1 

and for each of these observations, we solve the model 
selection problem using both classical and Nishii method with 
the (pf3 criterion. We set our maximal order m to 20. The 
classical method is a success if it chooses k = 15, while the 
Nishii method is a success if it chooses S = {1,2, 15}. 

2 ) Results and discussion: Figure [T] shows the percentage 
of succes of each method for n = 1000. The x axis represents 
the value of P used in the (pp criterion. The vertical lines 
correspond to the value of (3aic^ Pbic^ Pmin and /^max, always 
in that order; see equations Q and (|4]). 




We note that the AIC criterion often fails, especially with 
the Nishii method. The BIC criterion sometimes fails with the 
Nishii method, but the (f/s^^ criterion gives 100% success with 
both methods. For small values of the penalty i.e. P close to 

0, IC gets close to the ML method, thus overparametrize the 
model. Moreover the Nishii method is less efficient in this 
area because if it keeps just one index in |3, 14] U |16, 20], 
it fails ; while the classical method only fails if it chooses an 
order > 16. By opposition, for strong values of the penalty, IC 
tend to underparametrize the model. This happens here for the 
classical method and ^ ^ 0.45 : it only chooses order 2, thus 
misses ai5 = 0.45. The same happens for the Nishii method, 
but for 0.65 : it chooses support = {1, 15}, thus misses 
the parameter a2 = 0.4 which is the smallest. For (3 close to 

1, both methods underparametrize so much that they choose 
to keep no parameter at all. The same results are presented 
for n = 100 000 in figure [21 note that (3aic < as soon as 
n > 1619. 
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Fig. 2. Percentage of success for both methods, n= 100000 

Figure [3] presents the prediction error variance (PEV) of the 
models chosen by both methods for < < 0.35, i.e. before 
the classical method starts to underparametrize. 




Fig. 3. Prediction error variance, n=1000 



Fig. 1. Percentage of success for both methods, n=1000 



The more parameters are kept, the better the model fits the 
data, the smaller is the PEV. This explains why the PEV grows 



with (3 and why it is greater with Nishii method in the 100% 
success zone : Nishii method sets as = • • • = ai4 = 
while the classical method estimates them. However, PEV 
with the Nishii method is closer to the real one = 1. In 
that sense, the Nishii method appears to describe the model 
more precisely and the minimization of the PEV, equivalent 
to the ML method here, should not be a guideline for model 
selection. 

Figure IH shows for the same values of (3 the Kullback dis- 
tance between the true model (a, a) and the chosen one (a, a) : 

K{{a,ay,{a,a)) = -^ + log^ + ^Jr (^\AA-'){AA-')) 

where A and A are n x n matrix depending on a and a 
respectively : 

/I \ 

ai '•. 
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ai 1 / 




Fig. 4. Kullback distance to the true model, n=1000 

The Nishii method is seen to give a better description of 
the sought model in terms of Kullback distance. 

B. Two-dimensional autoregression 

1) Presentation: The support of the 2D AR model now 
contains couples of integers. In litterature, the classical ap- 
proach is based on supports of different types of geometry 
[10] : causal Quarter Plane (QP), causal Non-Sy metrical 
Half Plane (NSHP), semi-causal or Non-Causal (NC). As 2D 
spectrum estimation methods based on QP support provide 
nice results [2], we used here this type of support. 

Around a site, four QP supports can be defined. But, due to 
central symmetry, only two QP are associated with different 
sets of AR parameters. The first one is, with order (/ci, : 



while the second QP is : 

QP2k^M = |(^i^^2) 



< zi < A;i,0 < Z2 < ^2 

(n,i2)^(0,0) 



-ki <ii <{),{)< 12 <k2 

(n,z2)^(0,0) 



The classical 2D QP AR model of order (/ci, /C2) is : 



(ii,i2)GQPfci,fc2 



■ii,t2 



where QP is either QPl or QP2. We define Qk-^M the set 
of parameters of 2D QP AR model of order (/ci, /C2) so that 
|6/ci,/c2| = (^1 + 1) X (/c2 + 1), adding the variance of the 
prediction error to the set of AR parameters. 

By opposition to the Nishii method which works as in the 
ID case (each parameter associated with a couple of integers 
can be tested equal or not to zero), the increment in the 
cardinality of nested models is not always one. For example, 
0fei,fe2+i ^11^ 0fei+i,/c2 contains respectively (ki + 1) and 
(/c2 + 1) more parameters than ^kiM - ^^^^ ^^^^ implies that 
some indexes can be rejected by the classical method even if 
one of them would have been kept by the Nishii method. 

2) Results and discussion: For running simulations, we 
used two textures from the Brodatz album [4] (see Figure [5]) 
in order to show the application of the Nishii method on real 
2D processes. 




(a) d84 texture 




(b) d29 texture 

Fig. 5. 256x256 textures from the Brodatz album 



We set our maximal order to (mi, 7712) = (18,18) and 
use classical and Nishii methods together with (^[3^.^ criterion 
for determining respectively the order and the support of the 
autoregression. Figures [6] and [7] present the results, on the left 
of the current site is QPl, on the right is QP2. 
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Fig. 6. Results of classical and Nishii methods on d84 texture 
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though we did not suppose that our observation effectively 
comes from a true model, the model selected by the Nishii 
method is still more accurate. Moreover, as a perspective, the 
shape of the supports chosen by the Nishii method might be a 
discriminating factor between different texture images which 
might be used, for example, to improve recognition methods. 



References 



[1] 



IEEE 



H. Akaike. A New Look at the Statistical Model Identification. 
Transactions on Automatic Control, 19 :716-723, 1974. 
[2] O. Alata, R Baylou, and M. Najim. A New 2-D Spectrum Estimate using 
Multichannel AR Approach of 2-D Fast RLS Algorithms. In Proc. IEEE 
ICIP, pages 442-445, October 1997. 
[3] O. Alata and C. Olivier. Choice of a 2-d causal autoregressive texture 
model using information criteria. Pattern Recognition Letters, 24(9- 
10) :1191-1201, 2003. 

R Brodatz. Texture : a Photographic Album for Artists and Designers. 
New York, Dover, 1966. 

A. El Matouat and M. Hallin. Order selection, stochastic complexity 
and KuUback-Leibler information. In Athens Conference on Applied 
Probability and Time Series Analysis, Vol. II (1995), volume 115 of 
Lecture Notes in Statist, pages 291-299. Springer, New York, 1996. 
E. J. Hannan and B. G. Quinn. The determination of the order of an 
autoregression. /. Roy Statist. Soc. Ser B, 41(2) : 190-195, 1979. 
R. Nishii. Maximum likelihood principle and model selection when the 
true model is unspecified. /. Multivariate Anal, 27(2) :392-403, 1988. 
R. Nishii, Z. D. Bai, and R R. Krishnaiah. Strong consistency of 
the information criterion for model selection in multivariate analysis. 
Hiroshima Math. J., 18(3) :451-462, 1988. 

C. Oliver, F. Jouzel, and A. E. Matouat. Choice of the number of 
component clusters in mixture models by information criteria. Proc. 
Vision Interface, pages 74-81, May 1999. 

S. Ranganath and A.-K. Jain. Two-Dimensional Linear Prediction Mo- 
dels - part I : Spectral Factorization and Realization. IEEE Transactions 
on Acoustics, Speech and Signal Processing, ASSP-33(1) :280-299, 
February 1985. 

J. Rissanen. Stochastic complexity and modeling. Ann. Statist., 
14(3) : 1080-1 100, 1986. 

J. Rissanen. Stochastic complexity in statistical inquiry, volume 15 of 
World Scientific Series in Computer Science. World Scientific Publishing 
Co. Inc., Teaneck, NJ, 1989. 

G. Schwarz. Estimating the dimension of a model. Ann. Statist., 
6(2) :461-464, 1978. 

L. C. Zhao, C. C. Y. Dorea, and C. R. Goncalves. On determination 
of the order of a Markov chain. Statistical Inference for Stochastoc 
Processes, 4(3) :273-282, 2001. 

L. C. Zhao, P. R. Krishnaiah, and Z. D. Bai. On detection of the number 
of signals in presence of white noise. /. Multivariate Anal, 20(1) :l-25, 
1986. 



[4] 



[5] 



[6] 



[7] 



[8] 



[9] 



[10] 



[11] 



[12] 



[13] 



[14] 



[15] 



Fig. 7. Results of classical and Nishii methods on d29 texture 



Since it has to select rectangular supports, the classical 
method keeps sites which are not considered important by the 
Nishii method. Conversely, as noted earlier, the Nishii method 
keeps sites which are missed by the classical one. In the ID 
synthetized case, we saw in figure H] that the Nishii method 
gives a more precise description of the model. Here, even 



